Introduction to Open Data Science - Course Project

About the project

This course - Introduction to Open Data Science - makes use of the open research tools with open data and how to apply data-driven statistics.

Link to my project: https://github.com/SusannaSuntila/IODS-project

date()
## [1] "Fri Dec  2 20:14:40 2022"

Thoughts about the course

I am excited and I only hope that I can keep up with the pace and content of this course, but it all sounds very interesting. I have taken some courses in R, but I have not used it professionally or for that long. I want to become even more familiar in using R and I want to deepen my skills in visualizing, sharing, wrangling, testing and analyzing data. I heard about this course in my previous course for research skills.


Learning experience with the R book and Exercise set 1

  • The first exercise was a good introduction, as I was already a bit familiar with the book, but now I could focus on the parts that were more difficult for me, or that I hadn’t really used before.
  • Some more simple functions, like filter() or group_by() are something that I will try to make use of even more.
  • In my last R-course I was introduced to ggplot, and drawing plots and fine-tuning them was definitely my favorite part of this exercise. I hope to develop my skills on that score during this course.
  • Working with factors was also part of the book that I liked, and there were functions that I hadn’t used, like summary_factorlist.
  • I haven’t really used the tidy() function to tidy up the results of a statistical tests, so reading these chapters was good reminder of that option as well.
  • I haven’t done any reshaping of data with regards to the format, especially tidying from a wide to a long format. That was a more difficult part of the book for me, but it is definitely something I want to practice.
  • Different statistical tests and plotting data in order to choose the right test was also a more demanding part for me and something that I want to focus on during this course.
  • Many of the functions and approaches used from the finalfit-package were more difficult as well, as I haven’t used much of those.
  • I like the so called active reading, as it helps to understand better what is happening and also enables to try some changes to the code myself. I think the book is clear and well constructed.
  • I have a lot to learn about using R-markdown, so this has been a good start. There has been quite a lot of new concepts in the first week.

Regression and model validation

date()
## [1] "Fri Dec  2 20:14:40 2022"

Read in the data and describe it

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2
## Warning: package 'purrr' was built under R version 4.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
# the data-set that I created
learning2014 <- read.csv(file.path(".", "data", "learning2014.csv"))
# Look at the dimensions of the data
dim(learning2014)
## [1] 166   7
# Look at the structure of the data
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

There are 166 rows (observations) and 7 columns (variables) in this data. The gender variable is a character, the other variables are numbers or integers.

The original data is from a questionnaire that tracked different learning approaches and achievements on an introductory statistical course at university level. Gender and age are self-explanatory. The points variable tells how many exam points the person had. The attitude variable is a sum of 10 questions related to student’s attitude towards statistics, in a 1-5 scale. The remaining variables deep, stra and surf are combination variables that have been combined from multiple questions with the same dimension. The variables are averages of the selected questions concerning deep, surface and strategic learning.

Graphical overview of the data and summaries of the variables

library(ggplot2)

# set the theme white
theme_set(theme_bw())


# plot of student's attitude and points
ggplot(learning2014, aes(x = attitude, y = points, col = gender)) +
  geom_point() +
  geom_smooth(method = "lm") +
  scale_colour_manual(values = c("lightpink2", "maroon")) +
  labs(title = "Student's attitude compared to exam points",
       subtitle =  "learning2014 data")
## `geom_smooth()` using formula 'y ~ x'

library(GGally)
## Warning: package 'GGally' was built under R version 4.2.2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

# draw a scatter plot matrix of the variables in learning2014.
# [-1] excludes the first column (gender)
pairs(learning2014[-1])

# add color by gender
gender_col <- c("lightpink2", "maroon")[unclass(factor(learning2014$gender))]

pairs(learning2014[-1], pch = 19, col = gender_col)

# create a more advanced plot matrix with ggpairs()
ggpairs(learning2014, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))

# with color by gender
ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20))) +
  scale_colour_manual(values = c("lightpink2", "maroon")) +
  scale_fill_manual(values = c("lightpink2", "maroon"))

First, there are clearly more women who have participated in this questionnaire than men. When looking at the distribution of the variables, it is clear that age is concentrated below thirty, with a long tail towards right. With the age variable, the skew is larger than with any other variable. There is also more variation between men compared to women. Interestingly with the attitude variable, there is more variation with women compared to men. Men also have slightly higher mean, and their distribution is more skewed. The deep variable has a longer tail on the left side, so the observations are more concentrated on the higher side. The difference between genders is very small in this case. The stra variable has a wide distribution. The surf variable is more concentrated among women compared to men. The point variable has a thick tale among lower scores and the distribution is more concentrated among higher scores.

The largest and most significant correlation can be found with attitude and points. It is positive, so better attitude is related to better points in the exam. The lowest correlation is between deep learning and points from the exam which is interesting as well. Surf variable is also significantly correlating with attitude and deep.

Regression model with three explanatory variables

library(GGally)
library(ggplot2)

# draw a plot of the linear relation of exam points and attitude
qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

# create a regression model with multiple explanatory variables
model1 <- lm(points ~ attitude + gender + stra, data = learning2014)

# print out a summary of the model
summary(model1)
## 
## Call:
## lm(formula = points ~ attitude + gender + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.7179  -3.3285   0.5343   3.7412  10.9007 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9798     2.4030   3.737 0.000258 ***
## attitude      3.5100     0.5956   5.893 2.13e-08 ***
## genderM      -0.2236     0.9248  -0.242 0.809231    
## stra          0.8911     0.5441   1.638 0.103419    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.304 on 162 degrees of freedom
## Multiple R-squared:  0.2051, Adjusted R-squared:  0.1904 
## F-statistic: 13.93 on 3 and 162 DF,  p-value: 3.982e-08

In this model the dependent variable is exam points as instructed. The three explanatory variables that I chose are attitude, gender and stra. In the summary of the model it can be seen that only the intercept and the variable attitude are statistically significant, so attitude is the only variable strongly associated with exam points in this particular model. The coefficient on attitude, 3.5, tells that one unit change in attitude is related to a 3.5 point increase in exam points, conditional on other variables of this model. The gender variable is a dummy variable that has the value 1 when the person’s gender is male, so the coefficient of the gender variable describes the difference between the two genders, and when the person is male, it lowers the exam points by -0.22, again conditional on the other variables of this model. The stra variable that is computed from strategic questions has a coefficient of 0.89, meaning that better strategic learning will increase the exam points. The F-test that all three coefficients would be zero has a very small p-value, so it can be rejected.

As the variables gender and stra were not statistically significant, I dropped the gender variable first in the next model.

Regression model with one variable

# create a second regression model
model2 <- lm(points ~ attitude + stra, data = learning2014)

# print out a summary of the model
summary(model2)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

The attitude variable is still statistically highly significant. The attitude variable is positively related to the exam points, with 3.5 increase with one unit change in attitude for the better conditional on the stra variable. The intercept of this model means that if attitude would be 0, exam points would be 11.6 according to this model. What is interesting, is that when gender variable is removed from the model, the stra variable becomes significant at 10 percent level. It has a coefficient 0.9, meaning that one unit increase in strategic learning is related to a 0.9 increase in exam points, conditional on the attitude variable.

The multiple R-squared is 0.20, which means that the two variables attitude and stra together explain 20 percent of the variation in exam points.

Compared to the last model, the F-statistic has risen as well, and the p-value is smaller.

Diagnostic plots

# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
par(mfrow = c(2,2))
plot(model2,which = c(1,2,5))

The first plot that compares the residuals with the fitted values shows that the points are spread around quite randomly, except for a few outliers, so the model is pretty appropriate. The plot does not indicate non-linear trends or non-constant variance.

The Q-Q-plot shows that the residuals are somewhat normally distributed. However, the lower tail is more heavy, so the values are larger there than would be expected and the upper tail is lighter, so values are smaller there than expected.

The residuals vs leverage plot shows if there are any outliers that would be significant for the model. The Cook’s distance curved lines don’t show in this plot, so the outliers aren’t too disturbing and there aren’t any points that would have high residuals and too much leverage at the same time.


Logistic regression

date()
## [1] "Fri Dec  2 20:15:14 2022"

Read the data

library(tidyverse)
# read in the data that was created
alc <- read.csv(file.path(".", "data", "alc.csv"))

# print names of the variables
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

This data set is constructed from a secondary school student questionnaire of Portuguese schools. As the names of the variables indicate, the questions cover topics such as school and studies, but have also social and demographic aspects. The alc data set is combined from two data sets dealing with the students’ performance in math and Portuguese language.

Relationships between high/low alcohol consumption

The binary variable address is an interesting one, I would predict that when living in urban areas high alcohol use would be more probable, as there are more options to use alcohol than for students living in rural areas.

Another variable that I want to look at is the studytime one (weekly study time, divided into 4 groups, where 4 is the highest amount spent in studying), to see if spending more time studying is negatively related to high alcohol use, as the student is spending more time with studies and is possibly more committed to school.

I would hypothesize also that goout variable, telling how much the student goes out with friends (1-5 scale where 5 is very high) is positively linked to high alcohol use, as alcohol is usually a socially related issue.

Lastly I wanted to include the more continuous variable absences, which measures the number of school absences (0-93). I would hypothesize that more absences are positively linked to high alcohol use, as drinking a lot might affect the student’s attendance at school.

Explore variables’ distributions and relationships with alcohol consumption

First the address variable

library(ggplot2)

# set the background white
theme_set(theme_bw())

# bar plots of address variable
g1 <-  ggplot(data = alc, aes(x = address, fill = high_use)) + 
  geom_bar() +
  theme(legend.position = "none") +
  scale_fill_manual(values = c("paleturquoise3", "paleturquoise4"))

g2 <- ggplot(data = alc, aes(x = address, fill = high_use)) + 
  geom_bar(position = "fill") +
  ylab("proportion") +
  scale_fill_manual(values = c("paleturquoise3", "paleturquoise4"))

# combine the two plots
library(patchwork)
g1 | g2

As can be expected, most of the students live in urban areas, and so more people have high use of alcohol levels in urban areas, but contrary to what I thought relatively more people have high use of alcohol in rural areas.

Next the studytime variable:

# studytime variable
g3 <-  ggplot(data = alc, aes(x = studytime, fill = high_use)) + 
  geom_bar() +
  theme(legend.position = "none") +
  scale_fill_manual(values = c("plum3", "plum4"))

g4 <- ggplot(data = alc, aes(x = studytime, fill = high_use)) + 
  geom_bar(position = "fill") +
  ylab("proportion") +
  scale_fill_manual(values = c("plum3", "plum4"))

# combine the two plots
library(patchwork)
g3 | g4

As I thought, less studytime has higher proportion of students who have high use of alcohol. This is clear for both absolutely and proportionally.

Next the goout variable:

# the going out variable

g5 <-  ggplot(data = alc, aes(x = goout, fill = high_use)) + 
  geom_bar() +
  theme(legend.position = "none") +
  scale_fill_manual(values = c("palevioletred3", "palevioletred4"))

g6 <- ggplot(data = alc, aes(x = goout, fill = high_use)) + 
  geom_bar(position = "fill") +
  ylab("proportion") +
  scale_fill_manual(values = c("palevioletred3", "palevioletred4"))

# combine the two plots
library(patchwork)
g5 | g6

Here the high use of alcohol increases with the amount of going out with friends, and the pattern is very clear.

And lastly the absences variable:

# absences
g7 <- ggplot(data = alc, aes(x = absences, fill = high_use)) + 
  geom_bar() +
  theme(legend.position = "none") +
  scale_fill_manual(values = c("pink3", "pink4"))

g8 <- ggplot(data = alc, aes(x = absences, fill = high_use)) + 
  geom_bar(position = "fill") +
  scale_fill_manual(values = c("pink3", "pink4")) +
  ylab("proportion")

# combine the two plots
library(patchwork)
g7 | g8

Here it is also clear that increase in absences is linked to an increase in high use of alcohol as well, but there are also some outliers with few students who have a very high amount of absences.

Looking at the absences variable closer with a box plot:

# box plot of absences with color by address
ggplot(data = alc, aes(x = high_use, y = absences, col = address)) + 
  geom_boxplot() +
  scale_color_manual(values = c("violetred", "violetred4"))

The median of absences is the same whether student is form rural or urban area when there is no high use of alcohol, but the median for absences is slightly higher for students from rural areas if they belong to the high use of alcohol group.

Next few cross-tabulations. First I wanted to add a table that includes address, the mean of going out and of course high use of alcohol.

# table of address, high use of alcohol and the mean of going out
alc %>% group_by(address, high_use) %>% summarise(count = n(), mean_goout = mean(goout))
## `summarise()` has grouped output by 'address'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 4
## # Groups:   address [2]
##   address high_use count mean_goout
##   <chr>   <lgl>    <int>      <dbl>
## 1 R       FALSE       50       2.6 
## 2 R       TRUE        31       3.52
## 3 U       FALSE      209       2.91
## 4 U       TRUE        80       3.81

The mean of going out increases quite a lot when the alcohol use is high for both rural and urban area students. The mean of going out in both cases of alcohol use is higher for students who live in urban areas compared to rural area students, as would be expected.

Another cross-tabulation of the mean of studytime and mean of absences with regards to alcohol use:

# table of address, high use of alcohol and the mean of going out
alc %>% group_by(high_use) %>% summarise(count = n(), mean_studytime = mean(studytime), mean_absences = mean(absences))
## # A tibble: 2 × 4
##   high_use count mean_studytime mean_absences
##   <lgl>    <int>          <dbl>         <dbl>
## 1 FALSE      259           2.16          3.71
## 2 TRUE       111           1.77          6.38

Those students who have high use of alcohol also have lower mean of studytime and higher mean of absences compared to those who do not have high use of alcohol.

Logistic regression of the chosen variables

# simple model with only the absences as an explanatory variable
model <- glm(high_use ~ absences, data = alc, family = "binomial")

#summary of the simple model
summary(model)
## 
## Call:
## glm(formula = high_use ~ absences, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3598  -0.8209  -0.7318   1.2658   1.7419  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.26945    0.16094  -7.888 3.08e-15 ***
## absences     0.08867    0.02317   3.827  0.00013 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 434.52  on 368  degrees of freedom
## AIC: 438.52
## 
## Number of Fisher Scoring iterations: 4
####

# model with all the chosen variables: absences, studytime, goout and address
model1 <- glm(high_use ~ absences + studytime + goout + address, data = alc, family = "binomial")

# print out a summary of the model
summary(model1)
## 
## Call:
## glm(formula = high_use ~ absences + studytime + goout + address, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9982  -0.7761  -0.4950   0.8461   2.3232  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.92997    0.56532  -3.414 0.000640 ***
## absences     0.06687    0.02221   3.011 0.002600 ** 
## studytime   -0.59099    0.16862  -3.505 0.000457 ***
## goout        0.75392    0.12061   6.251 4.09e-10 ***
## addressU    -0.73038    0.30408  -2.402 0.016308 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 371.44  on 365  degrees of freedom
## AIC: 381.44
## 
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(model1)
## (Intercept)    absences   studytime       goout    addressU 
## -1.92997318  0.06687033 -0.59099384  0.75391575 -0.73038272
# compute odds ratios (OR)
OR <- coef(model1) %>% exp

# compute confidence intervals (CI)
CI <- confint(model1) %>% exp()
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                    OR     2.5 %    97.5 %
## (Intercept) 0.1451521 0.0466309 0.4305313
## absences    1.0691568 1.0250338 1.1197230
## studytime   0.5537766 0.3936409 0.7639234
## goout       2.1253059 1.6882118 2.7118282
## addressU    0.4817246 0.2648416 0.8757487

I looked at first a simple model, where absences was the only explanatory variable, and the AIC was 438.52. After adding the three other variables, the AIC had decreased to 381.44, so adding the other variables has improved the model.

When looking at the summary of the fitted model with all the chosen variables, all the coefficients are statistically significant at the 5 % level. As the exploration of the variables earlier implied, the relationship is positive with absences and goout, so increase in these variables will increase the odds of the student having high use of alcohol. Studytime and address on the other hand decrease the odds of having high use of alcohol.

Looking at the coefficients more closer, absences has an odds ratio of 1.069, meaning that a unit increase in absences increase the odds of the student having high use of alcohol by 6.9% keeping other variables constant, and the confidence interval is between 2.5% and 11.97%. This isn’t as large an effect, as I might have expected.

Studytime has a coefficient that is less than one, 0.55, so when studytime increases by one more unit, it decreases the odds of the student having high use of alcohol by 45% when other variables are constant, and the 95% CI is 61% - 24% decrease in odds. Studytime has an opposite effect when increasing, compared to absences.

The goout variable has an odds ratio of 2.125, meaning that a unit increase in goout increases the odds of the student having high use of alcohol by 112.5% again adjusting for the other variables in the model. The 95% CI is 68.8% - 171%. So going out has quite a large effect, which was apparent when plotting it in the beginning.

The address variable has an odds ratio of 0.48, meaning that if the student lives in an urban area, it decreases the odds of the student having high use of alcohol by 52%, when adjusting for the other variables. The negative effect has a 95% CI between 74% - 13%. This is quite surprising as in the beginning I wasn’t even sure of the direction of the effect.

Explore the predictive power of you model

First a 2x2 cross tabulation of predictions versus the actual values:

# fit the model
model1 <- glm(high_use ~ absences + studytime + goout + address, data = alc, family = "binomial")

# predictions
library(dplyr)
alc <- mutate(alc, probability = predict(model1, type = "response"))
alc <- mutate(alc, prediction = probability > 0.5)


# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   237   22
##    TRUE     63   48
# initialize a plot of 'high_use' versus 'probability' in 'alc'
ggplot(alc, aes(x = probability, y = high_use, col = prediction)) +
  geom_point() +
  scale_color_manual(values = c("skyblue2", "skyblue4"))

This model predicts high use incorrectly for 22 students, and low use incorrectly for 63 students.

Next the the total proportion of inaccurately classified individuals, the training error.

# define a loss function (mean prediction error)

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2297297

The training error is 0.2297, so on average ~23% of the predictions are wrong. The training error is below 0.5, so it is better than randomly guessing.

Perform 10-fold cross-validation on your model

#K-fold cross-validation, K=10
library(boot)
## Warning: package 'boot' was built under R version 4.2.2
cv <- cv.glm(data = alc, cost = loss_func, glmfit = model1, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2324324

One set of the 10-fold cross-validation gives the prediction error of 0.2378, so 23.78% of the predictions are wrong. This is slightly smaller than the model in the exercise set had, 0.26, so the model that I have used has a bit better test set performance.


Clustering and classification

Load the data

# access the MASS package
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:patchwork':
## 
##     area
## The following object is masked from 'package:dplyr':
## 
##     select
# load the data
data("Boston")

# explore the dataset
dim(Boston)
## [1] 506  14
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

There are 506 observations and 14 variables in this data set from MASS-package. The data is formed from housing values in suburbs of Boston. For example the variable crim tells the per capita crime rate by town.

Graphical overview of the data

library(GGally)
library(ggplot2)
library(reshape2)
## Warning: package 'reshape2' was built under R version 4.2.2
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(tidyr)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.2.2
## corrplot 0.92 loaded
#summary of the variables
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# draw a scatter plot matrix of the variables
pairs(Boston)

# the distribution of all the variables
ggplot(data = melt(Boston), aes(x = value)) + 
stat_density() + 
facet_wrap(~variable, scales = "free")
## No id variables; using all as measure variables

# correlations between the variables
cor_matrix <- cor(Boston) %>% 
  round(., digits = 2)

corrplot.mixed(cor_matrix)

Summaries of the variables show that the scale varies a lot for the variables.

When looking at the distributions of the variables, only the variable rm (average number of rooms per dwelling) is close to normal distribution and medv (median value of owner-occupied homes in 1000s) is somewhat normal. The variable of interest, crim, has most of the observations at the left tail as does the variable zn (proportion of residential land zoned for lots over 25,000 sq.ft). The reverse is true for the variable black (1000(Bk−0.63)^2 where Bk is the proportion of blacks by town). There are also few variables that have two peaks at their distribution; indus (proportion of non-retail business acres per town), rad (index of accessibility to radial highways) and tax (full-value property-tax rate per $10,000). Rest of the variables are skewed to left or right.

Finally, when looking at the correlation plot, crim correlates (positively) the most with rad. Largest correlations overall can be found with nox and dis (-0.77) and with rad and tax(0.91). It seems like indus, nox, dis, rad and tax correlate the most with other variables. On the other hand the variable chas (Charles River dummy variable) doesn’t really correlate with any variable.

Standardizing data, creating a categorical variable and forming a train and test set

First standardize the data set

# use the scale function
boston_scaled <- as.data.frame(scale(Boston))
boston_scaled$crim <- as.numeric(boston_scaled$crim)

# summaries of the scaled data
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

The initial data had very wide range of values for each variable and the sclales varied a lot depending on the variable, so standardizing has normalized the range of the values.

Next the creation of a factor variable form the crim variable:

# create a quantile vector of crim
bins <- quantile(boston_scaled$crim)

# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, labels = c("low", "med_low", "med_high", "high") ,include.lowest = TRUE)

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

And finally the creation of train and test sets:

# Divide the dataset to train and test sets, so that 80% of the data belongs to the train set
ind <- sample(nrow(boston_scaled),  size = nrow(boston_scaled) * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]

Fit the linear discriminant analysis on the train set

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2400990 0.2574257 0.2574257 0.2450495 
## 
## Group means:
##                   zn      indus         chas        nox          rm        age
## low       0.84659238 -0.9112085 -0.069385757 -0.8670377  0.44659926 -0.8471403
## med_low  -0.04259604 -0.2947889 -0.007331936 -0.5863599 -0.16649258 -0.3624240
## med_high -0.39488943  0.1812463  0.144094996  0.3707253  0.07461148  0.4091948
## high     -0.48724019  1.0171737 -0.033716932  1.0646188 -0.40890129  0.8021871
##                 dis        rad        tax    ptratio       black       lstat
## low       0.8631330 -0.7012664 -0.7422883 -0.4380328  0.38254079 -0.76480462
## med_low   0.4029892 -0.5434660 -0.4910847 -0.1127027  0.31894863 -0.11797774
## med_high -0.3402179 -0.4264107 -0.3255768 -0.2845848  0.08619689  0.04267226
## high     -0.8398905  1.6375616  1.5136504  0.7801170 -0.68631972  0.90189282
##                 medv
## low       0.52113870
## med_low  -0.01893558
## med_high  0.17322357
## high     -0.71591247
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.11920185  0.662912510 -0.89658020
## indus    0.03774394 -0.254805832  0.46456184
## chas    -0.01712695 -0.006820378  0.10108560
## nox      0.36277581 -0.835307200 -1.33477540
## rm       0.03368001 -0.049916434 -0.23416133
## age      0.22727608 -0.298372770 -0.06669981
## dis     -0.08643316 -0.261482039  0.19456640
## rad      3.37639522  0.875561681  0.03442656
## tax      0.04535651  0.085624137  0.31381445
## ptratio  0.12921175 -0.052490314 -0.36087910
## black   -0.08881825  0.066398063  0.11601822
## lstat    0.21345984 -0.213695910  0.38769783
## medv     0.07439210 -0.507072653 -0.21613808
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9568 0.0325 0.0107
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "navy", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, col = classes, pch = classes, dimen = 2)
lda.arrows(lda.fit, myscale = 2)

Predict the classes with the LDA model on the test data

# save the crime categories from the test set
# then remove the categorical crime variable from the test dataset
correct_classes <- test$crime
test <- dplyr::select(test, -crime)

Next predict the classes with the LDA model on the test data and cross tabulate the results with the crime categories from the test set.

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       21       7        2    0
##   med_low    3      12        7    0
##   med_high   0       7       13    2
##   high       0       0        0   28

The LDA model has predicted almost all the observations belonging to the high class correctly. On the lower end, there are few observations that have been predicted to the med_low instead of the correct low class. There is more variability in the predictions for the medium classes. The prediction for med_low has wrongly predicted for low class almost as much observations. So the LDA model has more difficulties in predicting the observations belonging in the middle than on the tails.

Investigate what is the optimal number of clusters

Reload the Boston dataset and standardize the dataset:

library(MASS)
data("Boston")

# use the scale function
boston_scaled2 <- as.data.frame(scale(Boston))
boston_scaled2$crim <- as.numeric(boston_scaled2$crim)

Calculate the distances between the observations

# euclidean distance matrix
dist_eu <- dist(boston_scaled2)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled2, method = "manhattan")

# look at the summary of the distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again:

set.seed(17)

# k-means clustering
km <- kmeans(boston_scaled2, centers = 3)

# plot the Boston dataset with clusters
pairs(boston_scaled2, col = km$cluster)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

# optimal number of clusters seems to be 2, as there is a large drop at that point.
# k-means clustering with two clusters
km <- kmeans(boston_scaled2, centers = 2)

# plot the Boston dataset with clusters
pairs(boston_scaled2, col = c("steelblue3", "sienna3")[km$cluster])

Looking at the pairs function it appears that one of the clusters has always absorbed the observations on one end or direction, so that two clusters looks quite appropriate when comparing the plots of all the variables against each other.

Perform k-means on the original Boston data with some reasonable number of clusters

library(MASS)
data("Boston")

set.seed(19)

# use the scale function to standardize the data
boston_scaled2 <- as.data.frame(scale(Boston))
boston_scaled2$crim <- as.numeric(boston_scaled2$crim)


# k-means clustering with 4 clusters
km <- kmeans(boston_scaled2, centers = 4)

# linear discriminant analysis
lda.fit2 <- lda(km$cluster ~ ., data = boston_scaled2)

# print the lda.fit object
lda.fit2
## Call:
## lda(km$cluster ~ ., data = boston_scaled2)
## 
## Prior probabilities of groups:
##          1          2          3          4 
## 0.39920949 0.31818182 0.05731225 0.22529644 
## 
## Group means:
##         crim         zn        indus        chas          nox         rm
## 1 -0.3554389 -0.4039269 -0.004726028  0.11748284  0.009509773 -0.2448635
## 2 -0.4071151  0.9395565 -0.951448942 -0.12560484 -0.934654735  0.6775631
## 3  3.0022987 -0.4872402  1.014994623 -0.27232907  1.059334474 -1.3064650
## 4  0.4410309 -0.4872402  1.093886784  0.03849464  1.033664373 -0.1906821
##          age       dis        rad        tax     ptratio      black       lstat
## 1  0.3085558 -0.225942 -0.5764965 -0.5036322 -0.09996773  0.2544955  0.08303063
## 2 -1.0699238  1.023927 -0.5966711 -0.6791796 -0.53145201  0.3578780 -0.86641188
## 3  0.9805356 -1.048472  1.6596029  1.5294129  0.80577843 -1.1906614  1.87087595
## 4  0.7148590 -0.779002  1.4419986  1.4625319  0.72271650 -0.6534850  0.60056774
##          medv
## 1 -0.09324226
## 2  0.72802972
## 3 -1.31020021
## 4 -0.52966703
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## crim     0.35322461  0.307758951 -1.58997833
## zn       0.09849662 -0.330566750 -0.14199716
## indus    0.27720773 -0.031523290  0.36923953
## chas     0.03859298  0.112548411  0.06821859
## nox      0.07110504 -0.230199232  0.17319460
## rm      -0.27410931 -0.568297722  0.31638241
## age      0.16778369  0.815075738  0.30710945
## dis     -0.28454054 -0.606949418 -0.07519390
## rad      1.75298697 -0.745321060  0.47791025
## tax      1.03317493 -0.661045609  0.17536769
## ptratio  0.16021535 -0.003247889  0.05098139
## black   -0.19237795  0.050874618  0.06252285
## lstat    0.27297507  0.032086428 -0.60051563
## medv     0.09553734 -0.204714316 -0.54447145
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.8386 0.0934 0.0680
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "navy", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(km$cluster)

# plot the lda results
plot(lda.fit2, col = classes, pch = classes, dimen = 2)
lda.arrows(lda.fit2, myscale = 3)

There are 4 distinct clusters that are somewhat separate, but the 4th cluster has some of the observations in the middle of the plot, not clearly belonging to any cluster. Clusters 1 and 2 are close to each other as are clusters 3 and 4.

It appears that rad, tax, age, dis and rm have the highest influence in separating the clusters. Looking at the arrows, tax and rad seem to be explaining more for cluster 4, and rm and dis more for cluster 2 for example.

Create a 3D plot

# run the given code 

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

library(plotly)
## Warning: package 'plotly' was built under R version 4.2.2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
#first plot
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
# target classes as numeric
classes <- as.numeric(train$crime)

# color by classes
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = classes)
# k-means clustering
km <- kmeans(model_predictors, centers = 4)

# target classes as numeric
classes2 <- as.numeric(km$cluster)

# color by km clusters
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = classes2)

All the plots have the same observations with most of the observations in a bigger groups and then a separate smaller group. The second plot that has the colors determined by crime classes has the separate smaller group belonging to one crime class. The third plot with the clusters has also the same smaller group in one cluster, but the lines between the clusters change a bit compared to the observations belonging to the crime classes.


Dimensionality reduction techniques

graphical overview of the data